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1.0  INTRODUCTION 


The  problem  frequently  arises  In  practice  where  n data  points,  (x^, 
y^) , (x2,y2),...,  and  (x^.y^),  are  given.  These  points  can  be  plotted 
on  an  x-y  graph  giving  a scattergram  of  the  data.  The  problem  is  to  find 
the  line  that  gives  a "best"  approximation  to  the  data,  and  to  determine 
the  "quality"  of  this  approximation.  In  the  following  paragraphs,  three 
definitions  of  a "best"  approximation  are  given,  the  corresponding  solu- 
tions for  the  best  line  are  derived,  and  an  expression  for  each  "quality" 
is  determined.  In  the  process,  the  relationship  between  this  data  ap- 
proximation problem  and  elementary  statistics  will  also  be  revealed. 
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2.0  NOTATION  AND  SOME  PRELIMINARY  RESULTS 


Since  the  subject  matter  involves  many  sums  from  i ■ 1 to  n,  the 

following  shorthand  notation  will  he  used: 

E x . ■ x,  + x.  + . . . + x 
11/  n 

No  limits  are  Indicated  on  the  E summation  sign  since  the  limits  i «*  1 and 
i * n are  understood. 

The  following  elementary  statistics  about  the  data  are  needed: 


y “-Ex.*  mean  of  x. 
x n i i 

(2.1) 

1 2 

S = - E (x.  - y ) * variance  of  x. 

xx  n i x i 

(2.2) 

y - - E y . ■ mean  of  y, 
y n J±  J1 

(2.3) 

1 2 

S - - E (y.  - y ) ■ variance  of  y . 
yy  n '■'i  'y'  J 1 

(2. A) 

s 

xy 

E (x^  - y^)  (y^  - y^)  = covariance  of  x^  and  y^ 

(2.5) 

Some  elementary 

algebra  results  in  the  following  useful  identities. 

Ex1  - nwx 

(2.6) 

E(x. )2  = nS  + ny  2 
i xx  x 

(2.7) 

EYl  " 

(2.8) 

E(yi)2  - nSyj[  + i>yy2 

(2.9) 

I(¥i>  * nSxy  + “Vy 

(2.10) 

These  identities  will  be  referred  to  repeatedly  in  the  following 

sections.  We  shall  assume  that  neither  S nor  S are  zero  (that  both 

xx  yy 

S and  S are  positive) . 
xx  yy  r 

Finally,  many  quadratics  will  appear.  Any  quadratic  Q(x)  of  the 
form  (2.11) . 

Q(x)  - A - 2Bx  + Cx2  (2.11) 

can  be  written  as 

Q(x)  " (A  - Cx  2)  + C (x  - x )2 
in  in 
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where 


x = B/C  (2.12) 

m 

When  C > 0,  then  the  quadratic  Q(x)  has  a minimum  at  x = x^.  It's  mini- 
mum value  is 

Q(x  ) - A - Cx  2 (2.13) 

m m 

Eqns  (2.12)  and  (2.13)  will  be  referred  to  for  the  minimizing  argument 
and  minimum  value  of  any  quadratic  that  appears  in  the  following  sections. 
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3.0  FIRST  DEFINITION  OF  BEST  LINEAR  APPROXIMATION 


When  the  x's  are  considered  the  independent  variables,  any  line  can 
be  written  in  slope-intercept  form  as  eqn  (3.1) 

y «=  mx  + b (3.1) 

where  m is  the  slope  and  b is  the  y-intercept.  Each  data  point  will,  in 
general,  not  satisfy  this  equation  but  will  leave  residuals  r^»r2* * * • >rn 
where 

r±  - y^^  - mx1  - b,  i = l,2,...,n  (3.2) 

The  first  definition  of  the  best  linear  approximation  is  defined  as  that 
line  where  m and  b minimize  the  sum  of  squared-residuals  G: 

G - E (rt)2  (3.3) 

The  optimal  line  is  determined  in  the  following  two  steps.  First, 
for  any  (fixed)  slope  m,  the  value  of  the  y-intercept  b which  minimizes 
G is  determined.  To  do  this,  substitute  (3.2)  into  (3.3)  then  expand: 

G - Z (yt  - mx1  - b)2 

“ Z (yA  - mxi)2  - 2 Z (y^  - mx^  b + E b2  ~~(3.4) 

Now,  equations  (2.6)  and  (2.8)  reduce  eqn  (3.4)  to  eqn  (3.5). 

G = E (y^  - rnx^2  - 2n(py  - my^)  b + nb2  (3.5) 

This  is  a quadratic  in  b similar  to  eqn  (2.11).  It‘s  minimum  is  given 
by  eqn  (3.6) . 

bm  “ (my  " / n " Hy  - (3.6) 

As  an  intermediate  result,  note  that  a horizontal  line  (constant 
function)  is  given  by  m = 0.  In  this  event,  the  best  value  for  b is  y 
and  the  resulting  value  for  G is 

G = E (y, )2  - ny  2 = nS 

i y yy 

These  results  are  summarized  by  Theorem  1. 

Theorem  1:  The  best  constant  approximation  to  the  y^  data  points  is 

given  by  y^,  the  mean  of  the  y^  The  "quality"  of  this  approximation  is 
given  by  the  variance  of  y^,  * G/n. 
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The  second  step  is  to  use  eqn  (3.6)  for  the  y-'intercept  in  the 
formula  for  G,  eqn  (3. A).  Eqn.  (2.13)  gives  the  value  for  G as  eqn 
(3.7). 


G - E (y±  - raxi)2  - n (y  - my^)2  (3.7) 

Expanding  this  and  using  eqns  (2.7),  (2.9),  and  (2.10),  eqn  (3.7)  reduces 
to  eqn  (3.8): 

2 

G ■ nS  - 2nS  m + nS  m (3.8) 

yy  xy  xx  v 

This  is  a quadratic  similar  to  eqn  (2.11),  hence  its  minimum  occurs  at 


m ■ (nS  ) 
m xy 


(nS  ) 
xx 


xy 


/ S 


XX 


and  its  minimum  value  is: 


(3.9) 


G 


nS 


yy 


- nS  m 
xx  m 


- nS  - nS2  / S (3.10) 

yy  xy  xx 

Eqn  (3.10)  leads  to  the  following  definition  of  the  "quality"  of  the 
approximation: 

R « S2  / (S  S ) (3.11) 

xy  xx  yy 

With  this  value  for  R,  eqn  (3.10)  becomes 

G - nS^  (1  - R)  (3.12) 


Since  both  and  are  positive,  R is  positive.  Since  G is  the 

sum  of  squared  terms,  it  is  positive  and  R is  no  greater  than  one.  A 

value  of  zero  for  R says  that  a constant  function  is  the  best  approxima- 

2 

tion  to  the  y-data  and  a value  of  one  for  R,  implying  that  G - 0 and  every 
residual  r^-0,  occurs  when  the  line  passes  through  every  data  point.  All 
this  is  contained  in: 


Theorem  2:  The  best  linear  approximation  to  the  y^  data  points  is 

given  by  the  line  (3.13) 


y «■  y + S (x  - y ) / S 

y xy  x ' xx 


(3.13) 


and  the  "quality"  of  this  approximation  is  given  by  the  quantity  R in 
eqn.  (3.11). 
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4.0  SECOND  DEFINITION  OF  BEST  LINEAR  APPROXIMATION 


The  second  definition  of  a "best"  line  is  given  when  the  y’s  are 
considered  the  independent  variables  and  the  x's  are  considered  the  depen- 
dent variables.  This  is  simply  an  interchange  of  the  roles  of  x arid  y 
from  the  previous  section.  The  derivation  of  the  equations  follow  in 
exactly  the  same  order  with  x in  place  of  y and  y in  place  of  x.  The 
results  are  stated  here: 


Theorem  3:  The  best  constant  approximation  to  the  x^  data  points  is 

given  by  y , the  mean  of  the  x . . The  "quality"  of  this  approximation  is 

X 1 ^ 

given  by  the  variance  of  x^,  S^  = G /n. 

Theorem  4:  The  best  linear  approximation  to  the  x^  data  points  is  given 

by  the  line  (4.1) 

x - y + S (y  - y ) / S (4.1) 

x xy  J y yy 

and  the  "quality"  of  this  approximation  is  given  by  the  quantity  R in 
eqn  (4.2) 

,2 


R - S“  / (S  S ) 
xy  yy  xx 


(4.2) 
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5.0  GEOMETRIC  BEST  LINEAR  APPROXIMATION 

The  third  definition  of  a "beat"  line  is  given  when  neither  the 
x's  nor  the  y's  are  given  preference.  The  x's  were  given  preference 
in  Section  3.0  and  the  y's  were  given  preference  in  Section  4.0.  In  each 
case,  the  distance  from  a point  to  a line,  the  residual,  was  measured 
parallel  to  one  of  the  axis.  In  this  section,  since  neither  axis  can  be 
so  chosen,  the  geometric  distance  from  a point  to  a line  must  be  used. 

The  derivation  of  the  best  geometric  line  is  much  easier  if  vector 
notation  is  used.  Let  * * * *^n  t*ie  ^ata  P°ints 


P 


i 


, i E 1,2 n 


(5.1) 


T 

For  any  vector  V let  V denote  the  transpose  of  V. 

In  vector  notation,  any  line  can  be  represented  in  parametric  form 


as 

L - A + Dt  (5.2) 

Where  A and  D are  vectors  and  t is  the  independent  parameter.  For  any 
point  P,  the  distance  from  this  line  to  P,  d(P,L),  is  the  mini- 
mum distance  from  P to  points  on  the  line.  For  any  point  on  the  line, 

2 

the  square  of  this  distance  is  d : 

d2  - (P  - A - Dt)T  (P  - A - Dt)  (5.3) 

which  expands  to: 

d2  - (P  - A)T  (P  - A)  - 2 DT  (P  - A)  t + DTDt2  (5.4) 

Eqn  (5.4)  is  a quadratic  similar  to  eqn  (2.11),  hence  its  minimum  occurs 
at 


t - DT  (P  - A)  / (DTD)  (5.5) 

ID 

and  its  minimum  value  is 

d(P,L)2  - (P  - A)T  (P  - A)  - DTDt  2 

ID 

- (P  - A)T  (P  - A)  - [dT(P  - A)]  2 / (DTD)  (5.6) 
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T T 

Since  all  vectors  are  real,  D (P-A)  ■ (P-A)  D and  equation  (5.6) 
becomes 

d(P,L)2  - (P  - A)T  (P  - A)  - ((P  - A)TD)  (DT(P  - A))  / DTD) 

- (P  - A)T  [i  - DDT  / (BTD)]  (P  - A)  (5.7) 

Where  I is  the  identity  matrix. 

Given  the  data  points  pq»P2» ' • • »Pn»  t*ie  "test"  line  L is  that  line 
which  minimizes  the  sum  of  the  distances  - square,  G: 

G - Z d(Pi,L)2  (5.8) 

Using  eqn  (5.7) 

G = E (Pt  - A)T  [i  - DDT  / (DTD)  ] (P1  - A)  (5.9) 

This  "best"  line  is  found  in  two  steps.  For  the  first  step,  fix 
the  vector  D and  choose  A so  as  to  minimize  eqn  (5.9).  For  this  purpose, 
let 

M - I - DDT  / (DTD)  (5.10) 

and 


A - A + A1S  (5.11) 

O 1 

for  arbitrary  vectors  Aq  and  A^  with  S a real  number.  With  this  form 
for  A,  eqn  (5.9)  expands  to 


G - E (P  - A )T  M (P,  - A ) - 2 EA,  M (P . - A ) S + ZA*  MA,  S2  (5.12) 
io  lo  lio  11 

The  vector  A^  minimizes  G if  the  linear  term  in  (5.12)  is  zero  for  every 

vector  A^,  i.e.,  if 

EM(P.  - A) 

1 o 

- M (E  P - nA  ) - 0 (5.13) 

1 o . 

While  there  are  many  vectors  A^  which  satify  (5.13),  the  obvious  and 
easiest  to  use  is  the  centroid: 


A « E P / n 

O 1 


(5.14) 


This  result  is  summarized  here. 
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Theorem  5:  The  best  linear  approximation  to  the  data  points  passes 

through  the  centroid. 

When  the  vector  A is  chosen  as  the  centroid  AQ,  the  value  of  G,  eqn. 
(5.9),  expands  to  become: 

G - - a/  (P±  - Ao)  - E (P±  - A/  (D  DT)  (P±  - Ao)  / (DT  D)  (5.15) 

The  value  for  G is  a minimum  when  the  value  for  the  second  term  in  eqn 
(5.15)  is  a maximum,  so  attention  will  now  focus  on  this  second  term,  H. 


H - E (P1  - Ao)T  (D  DT)  (P1  - Ao)  /(DT  D)  (5.16) 

Since  matrix  multiplication  is  associative,  H becomes 

H - E [ (P±  - Ao)T  d]  [dT  (P±  - Aq) ] / (DTD)  (5.17) 

Since  (P,  - A )T  D - DT  (P.-A  ),  H becomes 

1 O 1 O s 

H - E [dT  (Pi  - Aq)]  [(P±  - Ao)T  d]/  (DTD)  (5.18) 

Finally,  using  associativity  once  again 

[(Pt  - Ao>  (P1  - Ao)TJ  D / (DTD) 

- DT  [ E (P<  - A ) - A )T1  D / <dTd> 

L 1 O 1 O J 


Aq  is  the  centroid  so 


<pi  - V (pi  - Ao> 


- 


] 


<*i  - "x>2 


u<yi  - »y)  («i  - Ux) 

and  the  matrix  in  eqn  (5.19)  is  the  covariance  matrix: 


<*i  - px>  (yt  - Py) 

<yl  - py>2 


(5.20) 


£ (pi  - Ao>  <pi  - Ao>' 


nS 


XX 


nS 


xy 


nS 


nX 


(5.21) 


xy  yy 

The  quantity  H in  eqn  (5.19)  is  the  Rayleigh  Quotient  for  the  covariance 
matrix  (5.21),  which  attains  a maximum  equal  to  the  largest  eigenvalue  of 
(5.21)  when  the  vector  D is  a corresponding  eigenvector.  The  theory  of 
eigenvalues/eigenvectors  would  have  to  be  used  if  the  vectors  P^  had  more 


than  two  components.  In  the  esse  of  only  two  components,  let 


D « 


cos 

.sin 


:] 


for  some  angle  u.  Then  DAD  « cos2  u + sin2  u m 1 and  eqn  (5.19)  becomes 

H ■ S cos2  u + 2S  cos  u sin  u + S sin2  u (5.23) 

xx  xy  yy 

Eqn  (5.23)  can  be  simplified  by  using  the  trigonometric  identities 

cos  2u  « 2 cos2  u - 1 - 1 - 2 sin2  u (5.24) 

and 

sin  2u  = 2 cos  u sin  u 
the  simplified  form  for  H becomes 

H « S % (1  + cos2u)  + Sav  sin2u  + Sm  h (1  - cos2u)  (5.25) 

xx  sy  yy 

- % (S  + S ) + *5  (S  -S)  cos2u  + S sln2u 
xx  yy'  xx  yy'  xy 

Further  simplification  is  achieved  by  defining  r and  uq  by  eqns  (5.26), 
(5.27),  and  (5.28). 


r2  - % (S  -S  ) 2 + S 2 

xx  yy'  xy 

(5.26) 

rcos2u  ® h (S  - S ) 
o xx  yy 

(5.27) 

rsin2u  = S 

o xy 

(5.28) 

Figure  1 gives  a geometric  interpretation  to  these  equations. 


xy 


Figure  1 - Geometric  definition  of  r and  uo 
Using  these  values  for  R and  uq,  H becomes 

H-JjCS  +S  ) + rcos(2u  - 2u  ) 
xx  yy  v o 


(5.29) 
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This  is  maximized  by  choosing  ui*uo.  For  this  value  of  u,  the  value  of  H 
becomes 

R " 15  ^ + SvJ  + r (5.30) 

jut  yy 

Since  E (P^  - Aq)T  0^  - Aq)  * the  value  of  G,  eqn.  (5.15), 

becomes 

G - ZVt  - A / CPt  - Ac)  - % (S^  + Syy>  - r 

- + Syy>  - * <5-3U 

Since  r ^ 0 and  G ^ 0,  equation  (5.31)  leads  to  an  obvious  definition  for 
the  "quality"  of  the  approximation.  These  results  are  summarized  in 
Theorem  6. 


Theorem  6:  The  geometrically  best  linear  approximation  to  the  data  points 

P1,P2, . . .Pnis  given  as 

L = A + Dt 


where  the  vector  A is  the  centroid  of  the  data  points: 

A = E P^n 

and  the  vector  D has  components 

D ■ T cos  u 

o 


sin  u 


o -> 


(5.32) 

(5.33) 


where  uq  satisfies  eqns  (5.26),  (5.27),  and  (5.28).  The  quality  of  this 

linear  approximation  is  given  by  eqn  (5.34)  where  r is  defined  by  eqn  (5.26). 

R - 2r  (5.34) 

(S  + S ) 
xx  yy 


(5.35) 


The  formula  L = A + Dt  becomes 

x = M + Ct 
x 

y « M + St 

y 

where  C « cos(uq)  and  S * sin(uQ)  are  defined  by  equations  (5.26),  (5.27), 
and  (5.28).  If  the  angle  2uq  is  constrained  between  -ir  and  -Hr,  then  the 
formulas  for  C and  S are 


C -V(l  + C2)/2 


(5.36) 
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and 


where 


and 


S « sign(S  )Vu  - C )/2 

L 


C2  - cosC2uo)  - (S^  - Syy)/(2r) 


r -V%  (S  - S )2  + S ^ 
xx  yy  xy 


(5.37) 

(5.38) 

(5.39) 


5-6 


6.Q  REVIEW  OF  THE  THREE  DEFINITIONS 

For  the  purposes  of  clarity,  the  main  results  are  copied  here.  When 
the  x-axis  is  treated  as  the  independent  variable,  the  best  approximation 
is  given  by  eqn  (3.13) 

y - »y  + S„y  - V / SX*  0.13) 

and  the  "quality'1  by  eqn  (3.11). 

2 

R = S / (S  S ) (3.11) 

xy  xx  yy 

When  the  y-axis  is  treated  as  the  independent  variable,  the  best  approxi- 
mation is  given  by  eqn  (4.1) 

x «*  y +S  (y-y)/S  (4.1) 

x xy  y yy 

and  the  "quality"  by  eqn  (4.2) 

R - S2  / (S  S ) (4.2) 

xy  xx  yy 

When  neither  axis  is  treated  as  independent,  the  best  approximation  is 
given  by  eqn  (5.35) 


y + Ct 
x 


y + St 

y 


(5.35) 


where  C and  S are  defined  by  eqns  (5.36),  (5.37),  (5.38),  and  (5.39).  The 
"quality"  of  this  approximation  is  given  by  eqn  (5.34). 


2r 


(S  + S ) 
xx  yy 


(5.34) 


In  all  cases,  the  values  for  the  "quality"  R will  be  between  zero 
and  one.  A value  of  R near  one  Indicates  that  the  linear  approximation 
is  very  good  in  that  the  sum  of  squared  residuals  will  be  very  nearly 
zero.  A value  of  R near  zero  Indicates  that  the  linear  approximation  is 
very  poor  in  that  the  sum  of  squared  residuals  will  be  large.  The  value 
for  R is  not  dependent  on  the  magnitudes  of  the  data  points  but  only  on 
how  well  the  data  can  be  approximated  by  a line.  Acceptable  values  for 
R are  a function  of  the  type  of  problem  being  solved  and  the  required 
accuracy  of  the  linear  approximation. 

The  different  approximations  come  about  because  of  different  meanings 
for  the  phrase  "distance  from  a point  to  a line."  Figure  2 shows  the 
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three  definitions  used  in  this  report.  When  the  x-axis  is  treated  as 
the  independent  variables,  eqns  (3,13)  and  C3.HK  the  error  distances 
are  measured  parallel  to  the  y-axis,  in  Figure  2.  When  the  y-axis  is 
treated  as  the  independent  variable,  eqns  (4.1)  and  (4.2),  the  error 
distances  are  measured  parallel  to  the  x-axis,  e2  in  Figure  2.  When  neither 
axis  is  treated  as  the  independent  variable,  eqns  (5.35)  and  (5.34),  the 
error  distances  are  measured  perpendicular  to  the  line,  e^  in  Figure  2. 

The  choice  between  these  three  should  be  determined  by  the  problem.  If 
the  data  is  such  that  one  variable  is  clearly  the  independent  variable 
and  the  other  is  dependent,  then  the  appropriate  formula,  eqn  (3.13)  or 
eqn  (4.1),  should  be  used.  The  third  method  is  useful  when  both  variables 
are  measured  in  the  same  units  and  there  is  no  clear  indication  of  which 
is  the  dependent  variable. 

9 


• x 


Figure  2 Three  Distances  From  a Point  to  a Line 
Comparing  the  three  approximations  is  easier  when  the  data  has  been 


translated  and  normalized  such  that  y 


0 and  S 


x y xx  yy 

mean  and  divide  by  standard  deviation).  In  this  case,  the  co-variance  S 

is  called  the  (linear)  correlation  coefficient.  Eqns  (3.13)  and  (3.11) 
reduce  to  y 


1 (subtract 
xy 


S x and  R ■'  S2  . The  "quality"  is  the  square  of  the 
xy  xy 

correlation  coefficient.  Eqns  (4.1)  and  (4.2)  reduce  to  x - S y and 

R - S2  . Eqns  (5.35),  (5.36),  (5.37),  (5.38),  (5.39),  and  (5.34)  quickly 
xy 

reduce  to  y ■ sign(S  )x  and  R = |s  | if  S )*  0.  When  S "0,  there 

xy  xy  xy  xy 
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is  no  best  approximation  and  any  line  through  the  centroid  will  do  as  well 

as  any  other  with  a "quality"  of  R - 0.  Note  that,  in  this  third  case, 

the  "quality"  is  the  magnitude  of  the  correlation  coefficient,  or  the 

square  root  of  the  "quality"  for  the  first  two  approximations.  Indeed, 

whenever  S «■  S , this  third  definition  of  "quality"  will  be  the 
xx  yy 

magnitude  of  the  correlation  coefficient.  The  first  two  definitions 
of  "quality"  are  always  the  square  of  the  correlation  coefficient,  so 
the  three  definitions  of  "quality"  are  best  compared  when  the  third 
definition  has  been  squared.  Note  that  the  slopes  for  the  three  cases, 

S^,  1/S^y,  and  signCS^),  are  equal  only  when  » 1 or  S « -1.  In 

all  other  cases,  these  three  approximations  lead  to  three  different  lines. 
To  illustrate  the  differences  between  these  three  linear  approximations, 

two  examples  are  discussed.  The  first  example  is  a real  application  with 
the  raw  data  shown  in  Table  1.  The  x-values  are  the  lengths  of  subroutines 
(in  binary  bits)  compiled  on  a CDC-6600  computer.  The  y-values  are  the 
lengths  of  these  same  subroutines  (in  binary  bits)  compiled  on  an  IBM-370 
computer.  Both  variables  are  measured  in  the  same  units  (bits)  and  there 
is  no  clear  indication  which  should  be  the  dependent  variable.  This  data, 
plotted  in  Figure  3,  shows  a strong  linear  relationship.  The  various 
statistics  for  this  data  are: 

y - 7908.6  M - 8347.4 

x y 

S - 4.839xl0S * 7  S - 3. 816x10 7 

xx  yy 

S - 4. 253x10 7 
xy 

When  the  x-axis  is  chosen  as  the  independent  variable,  the  best 
linear  approximation  is 

y « 1395.3  + 0.879x 

with  a "quality"  R ■*  .980.  When  the  y-axis  is  chosen  as  the  independent 
variable,  the  best  linear  approximation  is 

x - -1395.0  + 1.115y 

with  a "quality"  R ■ .980.  Solving  this  approximation  for  y gives 

y - 1251.7  + 0.897x 

which  is  a different  line  than  the  first  approximation. 

When  neither  axis  is  chosen  as  the  independent  variable,  the  best 
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TABLE  1 RAW  DATA  FOR  EXAMPLE  1 


# 

X 

y 

1 

13260 

13680 

2 

1920 

3168 

3 

3240 

4480 

4 

2760 

3856 

5 

11820 

11840 

6 

4140 

4736 

7 

3960 

4432 

8 

9900 

7592 

9 

10320 

8560 

10 

1440 

2448 

11 

3420 

4624 

12 

1800 

3184 

13 

19140 

18520 

14 

6540 

7328 

15 

16680 

17808 

16 

3780 

4160 

17 

7080 

7168 

18 

14280 

13696 

19 

5520 

3600 

20 

31920 

27936 

21 

241  20 

23600 

22 

1800 

2816 

23 

10080 

10272 

24 

2160 

3536 

25 

1800 

3344 

26 

1920 

4176 

27 

6300 

6656 

28 

6720 

6896 

29 

2160 

3120 

30 

4020 

4768 

31 

8880 

9696 

32 

6240 

8160 

33 

17640 

16736 

34 

10440 

10352 

35 

7080 

8160 

36 

1380 

2656 

37 

4200 

6080 

38 

22380 

21984 

39 

3480 

5008 

40 

10980 

12256 

41 

3840 

4768 

42 

1620 

2736 

) 
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y 


x 


Figure  3 Plot  of  Raw  Data  for  Example  1 
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linear  approximation  is 


x «=  7908.6  + 0. 748t 
y = 8347.4  + 0.664t 

with  a "quality"  R=0.990  (R2  = .980).  This  last  approximation,  when 
solved  for  y in  terms  of  x,  gives 

y = 1332.4  + 0.887x 

which  is  a third  line.  All  three  lines  pass  through  the  centroid  (y  , y ) 

x y 

of  the  data  and  have  three  different  slopes  (0.879,  0.897,  and  0.887). 

These  slopes  differ  by  very  little  and  the  three  "qualities"  are  very 
high  (.980,  .980,  and  .980)  indicating  that  the  linear  approximations  are 
very  good. 

The  second  example  is  contrived  to  show  how  different  the  three 
approximations  can  be.  The  raw  data  is  given  in  Table  2.  The  various 
statistics  for  this  data  are: 

yx  = 0.5  yy  * 0.15 

S « 0.1  S “ 0.0078 

xx  yy 

and 

S « 0 
xy 

When  the  x-axis  is  chosen  as  the  independent  variable,  the  best  linear 
approximation  is 

y ~ 0.15 

with  a "quality"  R ■ 0.  This  is  a horizontal  line! 

When  the  y-axis  is  chosen  as  the  independent  variable,  the  best 
linear  approximation  is 

x = 0.5 

with  a "quality"  R ■ 0.  This  is  a vertical  line! 

When  neither  axis  is  chosen  as  the  independent  variable,  the  best 
linear  approximation  is 

x ■ 0.5  + t 
y = 0.15 

and  the  "quality"  is  R ■ .85  (R2  = .72).  This  is  the  same  horizontal  line 
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as  the  first  linear  approximation! 


TABLE  2 

RAW 

DATA  FOR  EXAMPLE 

2 

# 

X 

y 

it 

X 

y 

1 

.0 

.00 

7 

.6 

.24 

2 

.1 

.09 

8 

.7 

.21 

3 

.2 

.16 

9 

.8 

.16 

4 

.3 

.21 

10 

.9 

.09 

5 

.4 

.24 

11 

1.0 

.00 

6 

.5 

.25 
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APPENDIX  - PROGRAM  LISTING 


10  REM  "COMPUTE  BEST  LINEAR  APPROXIMATIONS" 

20  REM  "CODED  IN  HP-9830  BASIC  BY  JAMES  HURT  31  MAR  76" 
30  DIM  S[6] 

40  REM  "ZERO  SUMS" 

50  MAT  S= ZER 
60  REM  "ENTER  DATA" 

70  DISP  S[ 1 ] ; 

80  INPUT  X, Y 

90  REM  "TEST  FOR  END  OF  DATA" 

100  IF  X<0  THEN  210 
110  REM  "UPDATE  SUMS" 

120  S[ll=S[ll+1 
130  S 2 =S  2 +X 
140  S 3 =S  3 +X*X 
150  S 4 =S‘4  +Y 
160  S 5 =S  5'  +Y*Y 
170  S[6]=S[6J+X*Y 
180  REM  "PRINT  DATA" 

190  PRINT  S[l ] ;X;Y 
200  GOTO  60 

210  REM  "COMPUTE  STATISTICS" 


=S 

=s 

=s 

=s 

=s 


;2i/s 


/S 

/S 


5./S 

/S 


-S[2]*S[2] 


'l!-S[4]*S[4] 
1 J-S[2J*S[4J 


220 
230 
240 
250 
260 

270  REM  "PRINT  STATISTICS" 

280  PRINT 

290  PRINT  S[1 ] ; "POINTS" 

300  PRINT  "MEANS" ; S [ 2 ] ; S [ 4 ] 

310  PRINT  "VARIANCES" ;S[3];S[5];S[6] 
320  REM  "COMPUTE  STANDARD  DEVIATIONS" 
330  X=SQR(S[3]) 

340  Y=SQR(S[5J) 

350  REM  "CORRELATION  COEFFICIENT" 

360  R=S[6] / (X*Y) 

370  PRINT  "STD.  DEV.";X;Y;R 
380  REM  "X  IS  INDEPENDENT" 

390  X=S[6l/S[5l  „ 

400  Y=S[4j-X*S[2] 

410  PRINT  "Y=";Y;"+X1"';X 
420  REM  "Y  IS  INDEPENDENT" 

430  X=S[6]/S[5l 
440  Y=S[ 2]-X*S[ 4] 

450  PRINT  "X=";Y;"+Y*";X 
460  REM  "QUALITY" 

470  R=S[6]*S[6]/(S[3]*S[5]) 

480  PRINT  "QUALITY=" jR 
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Program  Listing  (Cont'd) 


490 

500 

510 

520 

530 

540 

550 

560 

570 

500 

590 

600 

610 

620 

630 

640 

650 

660 

670 


REM  "C2=C0S( 2*U0) " 

X=(S[3]-S[5l)/2 

R=SQR(X*X+S[6J*S[6]) 

C2=X/R 

REM  "X=C0S(U0)  AND  Y=SIN(UO) 
X=SQR((l+C2)/2) 

((1-C2)/2) 


Y=SQR, 
IP  S[6 
Y=-Y 
PRINT 


>=  0 THEN  580 


PRINT.  "X="  ; S £ 2 1 ;"+T*"  ;X 
PRINT  "Y=";S[4];"+T*";Y 
REM  "QUALITY" 
R=2-^R/(S[3]+S[5]) 

PRINT  "QUALITY=" ;R 

PRINT 

PRINT 

PRINT 

HID 


it 
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